arXiv:1509.06580v2 [cs.IT] 22 Jan 2016 


Graph-Based Lossless Markov Lumpings 

Bernhard C. Geiger* , Christoph Hofer-TemmeL 
* Institute for Communications Engineering, TU Miinchen, Germany 
1 Faculteit Militaire Wetenschappen, Nederlandse Defensie Academie, The Netherlands 
geiger@ieee.org, math@temmel.me 


Abstract —We use results from zero-error information theory 
to determine the set of non-injective functions through which a 
Markov chain can he projected without losing information. These 
lumping functions can he found hy cliqne partitioning of a graph 
related to the Markov chain. Lossless lumping is made possible 
hy exploiting the (sufficiently sparse) temporal structure of the 
Markov chain. Eliminating edges in the transition graph of the 
Markov chain trades the reqnired ontput alphabet size versus 
information toss, for which we present bounds. 

I. Introduction 

Large Markov models, common in many scientific dis¬ 
ciplines, present a challenge for analysis, model parameter 
learning, and simulation: Language n-gram models |[T1 Ch. 6] 
and models in computational chemistry and systems biol¬ 
ogy El, for example, belong to this category. For these models, 
efficient simulation methods are as important as ways to 
represent the model with less parameters. A popular approach 
for the latter is lumping, i.e., replacing the alphabet of the 
Markov chain by a smaller one via partitioning. This partition 
induces a non-injective lumping function from the large to 
the small alphabet. While, in general, the lumped process 
has a lower entropy rate than the original chain, in El we 
presented conditions for lossless lumpings, i.e., where the 
original Markov chain and the lumped process have equal 
entropy rates. Specifically, the single entry property we define 
in El Def. 3] holds if, given the previous state of the Markov 
chain, in the preimage of the current lumped state only a single 
state is realizable, i.e., has positive probability (see Fig. [T]). 

The emphasis on whether a state is realizable, rather than 
on its probability, is also common in zero-error information 
theory. Typical problems in zero-error information theory are 
error-free communication J?) (rather than communication with 
small error probabilities) and lossless source coding with side 
information 0. Both problems admit elegant graph-theoretic 
approaches which we recapitulate in Section HU 

In Section [Till we use these graph-theoretic approaches 
to find lossless lumpings for a given Markov chain. While 
the current state of the Markov chain cannot be inferred 
from its lumped image only, we require that it can be re¬ 
constructed by using the previous state of the Markov chain 
as side information (cf. Fig. [T]). The lumpings fulfilling this 
requirement correspond to the possible clique partitions of a 
graph derived from the Markov chain. The method is universal 
in the sense that it only depends on the presence, but not 
the precise magnitude, of state transitions of the Markov 
chain. In Section |I3 we relax the problem and reduce the 
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Fig. 1. The transition graph of an iiTeducible, aperiodic Maikov chain with 
alphabet X = {1,2,3,4}. The partition indicated by the red boxes induces a 
lumping function g, with g(l) = g{2) = 1' and g{3) = g(4) = 2'. While g 
is not invertible, side information about the previous state allows to determine 
the current state given only the lumped state: If the previous state is 1 and 
the current lumped state is 2’ (box on the left), only state 3 is realizable. 


output alphabet size of the lumping function by accepting 
that the lumped process has an entropy rate smaller than 
the original chain. We furthermore present bounds on the 
difference between these entropy rates. 

By design, lossless lumpings are not efficient source codes. 
Thus, it cannot be assumed that the reduced output alphabet 
size is related to the Markov chain’s entropy rate. Neverthe¬ 
less, in Section |V] we evaluate our lossless lumping method 
from a source coding perspective by applying it to length- 
K sequences of the original Markov chain. We show that 
the required size of the output alphabet never exceeds (and 
asymptotically approaches) the number of realizable length-iT 
sequences. Our lossless lumping method is thus an asymptot¬ 
ically optimal fixed-length, lossless source code. 

Future work shall apply the presented lumping methods 
to practical examples from, e.g., chemical reaction networks 
or natural language processing. Furthermore, while the con¬ 
nection between lossless lumpings and zero-error information 
theory is interesting and revealing, searching lossless lump¬ 
ing functions via clique partitioning can be computationally 
expensive. We have reasons to believe that the search for 
lumping functions can be cast as a constrained optimization 
problem whose properties are currently under investigation. 
Finally, we believe that the results presented in this work 
can contribute to zero-error source coding of processes with 
memory, complementing available results on zero-error coding 
for channels with memory (see 0 and the references therein). 
Section rviii hints at first results. 






II. Preliminaries from Zero-Error Information 
Theory 

Throughout this work, log denotes the natural logarithm, 
i.e., entropies and entropy rates are measured in nats. 

Let X := be an irreducible, aperiodic, stationary 

Markov chain with finite alphabet X transition 

probability matrix P, and invariant distribution vector /x. The 
adjacency matrix A is defined by A^^x' '■= \Px,x''\- We say 
a state x can access another state x', if Px,x' > 0 = 

1). We abbreviate X” := {X„, X„+i,..., X„}. The A-fold 
blocked process given by := is also 

Markov. Every length-A sequence of X is a state of 

Let 0 := {X,E) be a graph with vertices X and edges 
E C [X]'^, where [X]'^ is the set of two-element subsets of X. 
A set S' C A is a clique, if [S]^ C E, and an independent set, 
if [S]^ C\E = %. The clique number w{Q) and independence 
number a{Q) are the size of 0"s largest clique and independent 
set respectively. A clique partition of ^ is a partition of X into 
cliques of Q. The clique partition number 7 ( 0 ) is the size of 
the smallest clique partition of Q. The chromatic number x(G) 
is the minimum number of colours needed to paint X without 
having same-coloured neighbours. 

The complement graph Q has vertex set X and edge set 
[XY \ E. Edge-duality identifies cliques of Q and independent 
sets of Q and vice-versa, whence w{Q) = a{Q) and 7 ( 0 ) = 
x{G)- Lor further details on graph theory see (7). 

Let y := {1,M}. We consider a discrete, memory less 
channel (DMC) with input alphabet X and output alphabet y, 
defined by the transition probability matrix W, where Wx,y ■= 
Pr(F„ = y\Xn = x). In the case of a deterministic channel, 
i.e., one in which Wx^y G {0,1} and M < N, we can describe 
the channel by a lumping function g: X ^ y and call Y 
defined by := g{Xn) the lumped process. 

Definition 1. Let Gw '■= {X,Ew) be the (channel) confusion 
graph, where 

{xi,X2} S Ew &y: [Wxi.yl • \Wx^,y'\ = 1. (1) 

In the case of a deterministic channel, i.e., a lumping, denote 
the confusion graph by Gg ■= {X, Eg). 

The confusion graph connects two vertices if the channel 
confuses them with positive probability, i.e., if there exists 
at least one element in the output alphabet to which both 
inputs can be mapped. If the channel is deterministic, then 
the confusion graph Gg has a simple structure. 

Lemma 1. The confusion graph Gg consists of isolated cliques 
induced by the preimages of the lumping function g. Hence, 

Eg = [j [g-Yy)]" . (2) 

y&y 

The confusion graph is exactly the graph used in Shan¬ 
non’s original paper mi and the complement of the graph 
in na Sec. III]. The confusion graph determines the zero- 
error capacity C(W) of the channel. The number of messages 
that can be transmitted reliably via one channel use is the 


independence number of its confusion graph a{Gw)- For K 
channel uses, one requires the A-fold normal product of Gw 
with itself: G^y^ '■= {X^,E^), where [xf, x'f} G E^, if 
{xi,x)} € Xu Ew for alH = 1,..., A. In the limit, one has 
the zero-error capacity C{W) := sup^ l/Aloga(ty]^) > 
a{Gw)- In the case of a deterministic channel, the number of 
messages that can be transmitted reliably in one channel use is 
ct{Gg) = l{Gg) = |3^| = M, where |A| is the cardinality of the 
set A. Since the normal product of a graph of isolated cliques 
is again a graph of isolated cliques, one has C{g) = logM, 
cf. 10 p. 2209]. Lor such channels, separating source and 
channel coding is optimal 10 Prop. 1]. 

Let X and Z be two RVs with a joint distribution having 
support := {(a;, z) & X x Z-. Pr(X = x, Z = z) > 0}. 

Definition 2. Let G(x,z) •= (A!, E^^x.z)) be the characteristic 
graph of (X,Z), where {x,x'} G E(^x,z)^ if 

Vz e Z: Pr(X = x,Z = z)Pr(X = x', Z = z) = 0 , (3) 

i.e., if there is no z such that {x, z) G Si and (x', z) G Si. 

In other words, the characteristic graph connects two ver¬ 
tices, if the side information Z distinguishes between them. 
The characteristic graph is the complement of the graph 
defined by Witsenhausen a. It determines the smallest num¬ 
ber of messages that the transmitter must send to the re¬ 
ceiver, such that the latter can reconstruct X with the help 
of the side information Z. Lor a single transmission, the 
required number of messages is the clique partition number 
'y{G{x,z))- For K independent instances of {X,Z), one re¬ 
quires the A-fold co-normal product of G(x,z) with itself: 
S{X,Z) ■= ^ E^fY,z))^ where {xY,x'i} G E'^^zy if 

{xi,x'f\ G E^x,z) for at least one i = 1, ..., A. In particular, 
^'(x z) ~ G(x,z)<.^')’ the characteristic graph of the A-fold 
blocked process. The number of bits required to convey A 
instances is thus ^og'y{G)^xz) )<Xlog7(t?(x.z)). 

The characteristic graph G(x,z) depends only on the source 
and connects messages X that the channel may confuse, given 
the receiver has side information Z. The confusion graph Gw 
depends only on the channel and connects messages that the 
channel confuses. If the edge set of the latter is a subset of 
the edge set of the former, the channel confuses only messages 
that can be distinguished by incorporating the side information. 
This is the statement of 

Proposition 1. Ew ^ E(^x,z) ^ H{X\Y, Z) = 0. 

Proposition [T] proved in Section IVI-AI generalizes easily to 
multiple channel uses by considering the corresponding graph 
products. 

III. Graph-Based Lossless Markov Lumpings 

We use results from zero-error information theory to con¬ 
struct a lumping of a Markov chain such that the original 
Markov chain can be recovered without error. To this end, we 
assume that, for the reconstruction of X„, the receiver has 
the previous state X„_i as side information. This temporal 
side information determines the characteristic graph. A clique 




partition of this graph defines a lumping function g, whose 
confusion graph (Dehnition [1]) is a subset of the Markov 
chain’s characteristic graph. Then, Proposition [T] guarantees 
that the original chain can be perfectly reconstructed from its 
initial state and the lumped process. The remainder of this 
section makes these statements precise. 

Definition 3. Let Qx ■= (<L, Ex) be the characteristic graph 
of X, where 

{xi,X2} & Ex^'^x € X: =0. (4) 

In other words, the characteristic graph of a Markov chain 
connects two states, if every state can only access one of 
them. Since the Markov chains considered in this work are 
irreducible, the invariant distribution vector is positive and 
Dehnition [3] coincides with Dehnition |2] for a source with 
side information X„_i. 

Example 1. Consider the Markov chain in Fig. [T] Its charac¬ 
teristic graph has edge set Ex — {{1, 2}, {3,4}}. Both edges 
are cliques, and together they partition X. 

Choose an arbitrary clique partition of Qx, enumerate the 
cliques, and dehne g such that it maps each vertex in X 
to the index of its containing clique. This way, g assigns 
different values to vertices within different cliques. According 
to Lemma [T] the confusion graph Qg of g consists exactly 
of the cliques of the chosen clique partition of Qx, only that 
these cliques are isolated in Qg. This ensures that Eg C Ex- 
Let Yn := g{Xn) dehne the lumped process Y. Hence, by 
Proposition [T] we have 

=0. (5) 

Let H(X.) and H(Y) be the entropy rates of X and Y 
respectively. It is easy to see that the tuple (P,p) fulhls the 
single-entry property |[3] Def. 10]. Thus, the lumping is lossless 
in the sense of a vanishing information loss rate, i.e., 

iF(X|Y) := lim -H{X^\Y{^) = H{X)-H{Y) = 0. (6) 

n—^oo 71 

This follows from the chain rule (a), the fact that conditioning 
reduces entropy (6), and stationarity of X (c): 

1 ^ 

H{X\Y)^^ lim - (7a) 

n—¥oo 71 • ^ 

(b) 1 

< lim - Vi7(X,|Y,X,_i) (7b) 

n—^oo 71 

i—1 

H{X2\Y2,X^). (7 c ) 

The last term vanishes because g is such that (|5]l holds for all 
n. With this we have proven 

Corollary 1. If, for a given Markov chain X, the lumping 
function g satisfies Eg C Ex, then the lumping is lossless, 
i.e., i7(X|Y) = 0. 

Not only is the proposed lumping method lossless in the 
sense of Corollary [T] the original Markov chain can be 


perfectly reconstructed from its initial state Xi and from the 
lumped process Y. The initial state Xi and the state Y 2 of 
the lumped process together determine the state X2 of the 
original Markov chain. Then, X2 acts as side information to 
reconstruct X 3 from Y 3 , etc. 

We investigate the size M of the output alphabet required 
for g to be lossless. An optimal lumping function g induces the 
smallest possible partition of X, i.e., M = jiQg) = jiQx.)- 
From Dehnition [ 3 ] follows that no two states accessible from 
a given state x € X can be connected in Qx- Hence, if dmax 
is the maximum out-degree of the transition graph associated 
with P, i.e., 

^max ■— max ^ ^ Ax,x' (8) 

x'ex 

then Qx contains at least dmax cliques. We recover 
Proposition 2 (||T0l Prop. 3]). M > dmax- 

Witsenhausen ||5] Prop. 1] showed that this lower bound can 
be achieved using the side information, which is available at 
both ends. The achievable scheme requires that, for every state 
of the side information X„_i, a separate lumping function is 
used. Our restriction to a single lumping function leads to an 
output alphabet size generally larger than dmax- However, if A 
is sufficiently sparse, then the presence of side information at 
the receiver helps to make the output alphabet size still strictly 
smaller than N. 

Example 2. Consider the Markov chain in Fig. [T] and as¬ 
sume that all transitions have probability 0.5. By symmetry, 
it follows that H(Xn) = log A = log 4 and 77(X) = 
logM = log 2. The output alphabet size is optimal in terms of 
Proposition |2| H(Yn) = H{Y) — logM = logdmax = log 2. 

The proposed lumping method depends only on the location 
of zeros in the adjacency matrix A. It follows that the method 
is universal in the sense that the obtained lumping function g 
is lossless for every Markov chain with adjacency matrix A. 
Moreover, g is lossless for every stationary process, for which 
the non-zero one-step transition probabilities are modelled by 
A. Equations (|7]i do not require Markovity of X, whence 
Corollary [T] remains valid. However, our lumping method is 
only useful for Markov chains (or stationary processes) with 
a deterministic temporal structure, i.e., for sparse matrices A. 

Example 3. Suppose that P is a positive matrix, collecting the 
conditional probability distribution of two consecutive samples 
of X. Hence A is a matrix of ones, and the edge set Ex of 
the characteristic graph Qx is empty. Thus, M = '){Qx) = A. 
The only lossless lumping functions are permutations, hence 
lumping does not reduce the alphabet size. 

Note finally that instead of defining g via a clique partition 
of Qx, one can also dehne a stochastic lumping W via a 
clique covering of Qx- This still ensures that Aw L Ex holds 
and that the statement of Corollary [T] remains valid. While 
clique covering leads to additional freedom in the design of 
the lumping, it does not reduce the required output alphabet 
size compared to clique partitioning: If two cliques and S2 


cover a subset of the vertices X, then the two cliques Si and 
S 2 \ Si partition it. 

IV. Graph-Based Lossy Markov Lumpings 
We generalize the characteristic graph of the Markov chain 
by eliminating edges from its transition graph (i.e., ones in its 
adjacency matrix A) if the transition probabilities fall below 
a certain threshold; 


Definition 4. For e > 0, the e-characteristic graph of X is 
the graph Q,. := {X,Eg), where 

{xi,X2} & Es ^'ix € X: \Px,xi-£\ • \Px,x2-= 0 - 

(9) 

Definition |4] is equivalent to Dehnition|2 if A is defined by 
-A-x,x' '■= \Px,x'—£\■ Decreasing the number of ones in A can 
only increase the number of edges in the characteristic graph, 
which in turn can only make the cliques larger and the clique 
partition number smaller. Hence, i?x C E^ and 7(f/x) > 
By eliminating edges, one may trade information loss 
for alphabet size. For the former, in Section IVI-BI we prove 
a bound depending on e, the number N of vertices, and the 
cardinality of the output alphabet M: 

Proposition 3. Take e < 1/7V and Eg C then 

iF(X|Y) <{N- M)e (1 - loge) < NH 2 {e) , (10) 

where H 2 {p) := —plogp — (1 — p)log(l — p). The first 
inequality already holds for e < 1/e. 

Applying Proposition [3 to e = 0 recovers Corollary [T] The 
following example illustrates that if the entropy rate of X falls 
below the bound in Proposition [3 the lumped process Y can 
become trivial. 


Example 4. Suppose that 


P = 


1 — e 
e 


e 

1 — e 


( 11 ) 


It follows that pi = fi 2 = 1/2 and that H(X.) = H 2 {s). 
Moreover, as is fully connected, g is constant with M = 1. 
Thus, Y is a constant process and H(Y) = 0. 


Reconstructing X from Y (with small probability of error) 
requires reconstruction methods more sophisticated than those 
for the lossless lumping method introduced in Section [HI] 
Given knowledge of the previous state A„_i and the current 
lumped state Y„, the current state A„ can not be reconstructed 
without error. Hence, the side information used for reconstruct¬ 
ing the next state might not be correct, which leads to error 
propagation. 


V. A Source Coding Perspective on Lossless 
Markov Lumpings 

The intended application of the lumping method introduced 
in Section nni- model reduction in speech/language process¬ 
ing m or systems biology m - imposes several restric¬ 
tions. The lumping is a time-invariant, preferably deterministic 
mapping from the large alphabet A to a smaller alphabet 


y and operates on a symbol-by-symbol basis in order to 
represent a partition of the original alphabet. These restrictions 
- stateless, fixed-length, and symbol-by-symbol - make our 
proposed method an inefficient source code. Despite this 
apparent incompatibility, we critically evaluate our lossless 
lumping method from a source coding perspective. 

First, our lumping method can be used as a (universal) pre¬ 
processing step, after which more sophisticated compression 
schemes follow. For example, it can be easily extended to 
a variable-length symbol-by-symbol scheme by, e.g., optimal 
Huffman coding of the lumped states. 

Second, we may still require the lumping to be stateless 
and hxed-length, but dehne the lumping function g on the 
iT-fold Cartesian product X^. Hence, g lumps sequences of 
length K rather than states. Due to the deterministic temporal 
structure of X, the alphabet size for lumping these length-iT 
sequences is not larger than the number of realizable sequences 
of this length. In other words, our scheme is at least as good 
as, and asymptotically equivalent to, any fixed-length, lossless 
coding scheme that en-/decodes sequences independently of 
each other. To show this, let A be the largest eigenvalue of the 
adjacency matrix A. If the Markov chain X has adjacency 
matrix A, the logarithm of A bounds the entropy rate of X 
from above, i.e., //(X) < log A ifTTIl . ifT^ . In Section IVl-CI 
we prove 

Proposition 4. For each K, let gK'- X^ —>■ Vat be the optimal 
lumping function for the Markov chain X^^/ i.e., it induces 
the smallest clique partition of its characteristic graph ■ 

Let Mk '■= |Vif|- Let the set of realizable states o/X^^^ be 

Sk := {x G X^: Pr(A^^) = x) > 0}. (12) 


Then, Mk < and 


log Mk 

hm -— 

K—ioc K 


log A. 


(13) 


Example 5. If K = 1, then M = Mi < |5i| = |A| = N. If 
A = 2, then M 2 < = E,., < Nf 

While Mk < especially for small K and sparse A, 

the inequality may be strict. This advantage disappears for 
increasing K due to the Markov property, and the required 
alphabet size approaches the number of realizable length-A 
sequences, which for large A behaves like A^ ifT^ . Thus, 
while our lossless lumping method is asymptotically optimal 
in the sense of Proposition |4l for the intended application 
of reducing the alphabet it seems to be most efficient when 
applied symbol-by-symbol. 


VI. Proofs 

A. Proof of Proposition [3 

Let Qx,z Pr(A = x,Z = z). First, assume that 
E[{X\Y, Z) > 0. There exist triples {x,y,z) and {x',y,z) 
such that 

Pr(A = x,Y = y,Z = z) = Qx,zWx,y > 0 


(14a) 







and 


Pr(X = x',Y = y,Z = z) = > 0 . (14b) 


Hence, each term of the products on right-hand sides 
above must be positive, from which Qx,zQx',z > 0 and 
IWx^yWx’^yl = 1 follows. As a consequence, by the defi¬ 
nitions of the channel confusion graph and the characteristic 
graph of {X,Z), we have {x,x'} S Ew and {x,x'} ^ 
E(x,z)- Thus, Ew ^ E(^x,z)- 

Second, assume that Ew 't- E{jx,z)- Then, there exists 
{x,x'} e [XY that {x,x'} e Ew and {x,x'} ^ 

E{x,z)- It follows that there exists at least one z' such that 
Qx,z'Qx',z' > 0, and at least one y' such that Wx,y'Wx',y' > 
0. Hence, the two probabilities in equations (fl4li are positive 
for z = z' and y = y'. Thus, H{X\Y, Z) > 0. □ 


B. Proof of Proposition |5] 

That iT(X|Y) < H{X 2 \Y 2 ,Xi) follows from O. If we 
define Rx^y := Y.x‘&g-^(y) *en we get 

HiX2\Y2,X,=x) = -Y, Y. (15) 

VGyx'eg-Hv) 

The assumption Eg C E^ implies that g~^{y) is a clique in 
E^, whence each x G X can access at most one element in 
9~^{y) with a probability larger than e. Hence, let x G g~^{y) 
be such that for all other x" G y“^(y)\{i}, Px,x" < Thus, 

Rx,y < Px,x + £ {\g~^{y)\ - i) ■ (16) 

We derive the first inequality in (fTOl l: 


H{X2\Y2 ,Xi=x) 
Rx 
Px 


= E ^ - E E log 


y^y 


yey x'eg-^{y)\{x} 


R 


x,y 


(a) 


— E! (R^^y Px,x) E! E! slog 


v&y 


y&y x'eg-^{y)\{x} 


R 


'x,y 


< Y ~ ~Y Y 

V&y y&y x'eg-^{y)\{x} 


yey 

= (N — Xf)e (1 — loge) , 


where (a) is because log(l-|-a;) < x, for x' f x, Px,x' < e and 
—plogp increases on [0,1/e], (6) follows because Rx,y < 1, 
and (c) is due to (fTbl) . 

For the second inequality in (ITOl) . because log(l + x) < x, 
we have 7Ve(l — e) < —N{1 — e) log(l — e). By assumption, 
eN < 1, whence Ate(l — e) > {N — M)e, for all M > 1. 
Thus, 


{N — M)e (1 — loge) = {N — M)e — {N — M)s\ogs 

< Ne{l — e) — Neloge 

< — N{1 — e)\og{l — e) — Neloge 

= NH 2 {e). □ 


C. Proof of Proposition |4] 

The set of all unrealizable length-iT sequences X^ \ Sk is 
a clique in Qk and every vertex in this clique is connected to 
every vertex outside of it. To see this fact, take x G X^ such 
that Pr(AiE^ = x) = 0. Since this state can not be accessed, 
w.l.o.g. the x-th column of the corresponding adjacency matrix 
is zero. This means that, for every x' G X^, realizable or not, 
{x,x'} G Ek- 

Since X^ \ Sk is a clique, and since every state in this 
clique is connected to an arbitrary x G Sk, also {x} U {X^ \ 
Sk) is a clique. A trivial clique partition thus consists of this 
clique and all the trivial single vertex cliques of vertices in 
5ir\{x}. This clique partition has size |5 k|. Since this clique 
partition may not be optimal, we get Mk = i{Qk) < |<5k|. 

For the asymptotic result, note that limif_>oo(logMif j/AT 
cannot be smaller than H{X). But since //(X) = log A 
is achievable, we have luaK^ooilog^K)/K > log A. Fur¬ 
thermore, the number of realizable length-AT sequences of a 
Markov chain behaves like A^ as AT increases. Specifically, 
limif^.oo(log D/AT = log A lHU. Together with Mk < 
jiSicj, this establishes (fOT l. □ 

VII. Zero-Error Source Coding of Stationary 
Processes 

Based on the classic papers a and la, most results in zero- 
error information theory are based on memoryless channels 
and sources. While there exist extensions to channels with 
memory, see a and the references therein, to the best of 
the authors knowledge sources with memory have not been 
dealt with yet. We believe that applying zero-error information 
theory to Markov chains motivates such an extension. This 
section presents a first result. 

Assume the source produces two jointly stationary random 
processes X and Z, and assume that the support of the 
marginal distribution is Si := {(x, z) G X x Z: Pr(X„ = 
X, Zn = z) > 0}. Furthermore, let Sk be the support of 
the joint distribution of AT samples, i.e., the joint distribution 
of {XY,ZY)- Clearly, Sk C S^- We already mentioned 
that the AT-fold co-normal product G'^x z) ^{x,z) is the 
characteristic graph of the AT-fold blocked source, assuming 
that the source (X, Z) is iid |[8l. We claim that independence 
is not necessary, but that Sk = Sf suffices. As soon 
as Sk C SY, the edge set of G(x,z)(^) tnay become a 
strict superset of the edge set of G'l^xzy Only deterministic 
dependence, where not all sequences {xf, z^) are realizable, 
can reduce the required alphabet size as compared to the iid 
assumption. 

If the receiver obtains the side information via a discrete, 
memoryless channel, the we get 

Proposition 5. Let "K be a stationary stochastic process with 
support Sk of the distribution of given as in 

Proposition 0 and let the side information Z^ be given via 






a DMC W, i.e. 


References 


K 

Pr(Xf = xf, Zf = zf) = Pr(Xf = x^) ■ 

(17) 

Then, the characteristic graph zy^'i edge set 

E(x z)(^) = U {{x, x'}-. X G x' GX^\ Sk} ■ 

(18) 

Proposition |5] states that a deterministic temporal structure 
of the source can only decrease the clique partition number, 
making compression more efficient. If E^x,zy^'> — for 

some K, then no information needs to be transmitted because 
all information about is already contained in the side 
information . We believe that this analysis can be extended 
to more general side information structures and to variable- 
length zero-error source codes as in ini, M- 

Proof: By Definition |2 {x,x'} G E^x,z)(^)’ Iff’ for all 
0 G Z^, 

Pr(Xf = X, Zf = z)Pr{X^ = x' , = z) = 0 . (19) 

With Xi the i-ih coordinate of x, we write 

K 

Pr(Xf = x,Z^ = z) = Pr(Xf = x) n W ,,,,, (20) 

and see that (HD holds, iff at least one of the following 
conditions holds: 


Pr(Xf = cc) = 0 , (21a) 

Pr(2ff = x') = 0 , (21b) 

K K 

nn =0. (21c) 

i=l 


Equation (12lab (and, similarly, equation (I21bb ) imply that if a 
sequence x is not realizable, then (fTOl l holds for all x' G . 
Hence, in G(x,zy^'>^ sach unrealizable state x is connected 
to every other state. With Sk being the set of realizable 
sequences, we get {{x,x'}: x G X^,x' G X^ \ Sk} C 
E{x,zy’<'i- 

We may assume w.l.o.g. that iSi = X, i.e., that all states are 
realizable. Then, since the iG-fold co-normal product G^xz) 
of G{x,z) is the characteristic graph of the source emitting 
{X, Z) iid, we have {x, x'} G E'^xz)^ tff, for all z G , 


K K 


nn pXi Txj ,Zi fPc 


= 0 . 


( 22 ) 


j=ii=i 


Since we assume that /x > 0, this is equivalent to (I21cb . Hence, 
also E'^xz) — E{x zf^)- This covers all cases of (ISTT i. ■ 
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